CBG-3271 allow in memory buckets to persist #16

torcolvin · 2023-11-08T14:15:46Z

in memory buckets will now live for the lifecycle of the program until Bucket.CloseAndDelete is called. This facilitates bucket closing without removing the data, which is a common use in Sync Gateway tests that use persistent config to update database configuration.
When a bucket is first created, a copy is stored in the bucket registry, and this is never closed until:
- in memory: CloseAndDelete is called
- on disk: Close is called to bring refcount of buckets to 0
force bucket name to be defined, and do not let multiple copies of a persistent bucket to be opened if they are different paths. Note, this is a blunt instrument, and it is indiscriminate to path manipulations. It does path matching on lexography, not normalized paths. The idea is to be safer. Sync Gateway is not architected to support multiple buckets with the same name that do not have the same backing data.

Implementation:

The global state of a bucket is representative of two things:

*sql.DB represnting the underlying sqlite connection
dcpFeeds for each connection that exists

The sql.DB connection can be opened multiple times on the same path, and this was the original implementation of rosmar. However, it can't be opened multiple times for in memory files except via cache=shared query parameter to sqlite3_open. This ended up causing behavior that I didn't understand, and is not typically a supported in sqlite, since multiple databases are managed with a WAL when on disk.

DCP feeds in rosmar work by having pushing events on a CUD operation to a queue, which can be read by any running feeds. Instead of having separate feeds for each copy of Bucket and publishing them via bucketsAtUrl, we now only have a single canonical set of bucket feeds. Moved this field from a Collection to Bucket. This addresses https://issues.couchbase.com/browse/CBG-3540

Whether a bucket is open or closed is controlled by Bucket._db() that is called by any CRUD operations. A Collection has a pointer to its parent bucket. Each Bucket opened now will create a Collection dynamically, but these share pointers to the cached in memory versions.

Depends on couchbase/sg-bucket#110

- in memory buckets will now live for the lifecycle of the program until Bucket.CloseAndDelete is called. This facilitates bucket closing without removing the data, which is a common use in Sync Gateway tests that use persistent config to update database configuration. - When a bucket is first created, a copy is stored in the bucket registry, and this is never closed until: - in memory: CloseAndDelete is called - on disk: Close is called to bring refcount of buckets to 0 - force bucket name to be defined, and do not let multiple copies of a persistent bucket to be opened if they are different paths. Note, this is a blunt instrument, and it is indiscriminate to path manipulations. It does path matching on lexography, not normalized paths. The idea is to be safer. Sync Gateway is not architected to support multiple buckets with the same name that do not have the same backing data. Implementation: The global state of a bucket is representative of two things: - *sql.DB represnting the underlying sqlite connection - dcpFeeds for each connection that exists The sql.DB connection can be opened multiple times on the same path, and this was the original implementation of rosmar. However, it can't be opened multiple times for in memory files except via cache=shared query parameter to sqlite3_open. This ended up causing behavior that I didn't understand, and is not typically a supported in sqlite, since multiple databases are managed with a WAL when on disk. DCP feeds in rosmar work by having pushing events on a CUD operation to a queue, which can be read by any running feeds. Instead of having separate feeds for each copy of Bucket and publishing them via `bucketsAtUrl`, we now only have a single canonical set of bucket feeds. Moved this field from a Collection to Bucket. Whether a bucket is open or closed is controlled by Bucket._db() that is called by any CRUD operations. A Collection has a pointer to its parent bucket. Each Bucket opened now will create a Collection dynamically, but these share pointers to the cached in memory versions.

adamcfraser · 2023-11-08T22:27:15Z

bucket-registry.go

 	}
+	// Copy the slice before mutating, in case a client is iterating it:


Whereabouts do we iterate over the slice without holding the lock?

This code was carried over from the original implementation, I've simplified it for refcounting.

adamcfraser · 2023-11-08T22:31:15Z

bucket_api.go

+	bucket.closed = true
+}
+
+func (bucket *Bucket) _closeAllInstances() {


What are the 'AllInstances' here, since it looks like it's just closing this single bucket's connection?

Renamed this closeSQliteDB although it really closes more than that. If this function is called on any bucket, it will shut down all Bucket that have a matching name. Any suggestion of a better name to reflect would be welcome.

adamcfraser · 2023-11-08T22:33:37Z

bucket.go

+		collections:     make(map[sgbucket.DataStoreNameImpl]*Collection),
+		collectionFeeds: make(map[sgbucket.DataStoreNameImpl][]*dcpFeed),
+		mutex:           &sync.Mutex{},
+		nextExp:         new(uint32),


I'm not familiar with this usage - what's the advantage of setting new(uint32) instead of leaving nil?

This is just initializing a an int pointer to 0, is there a more canonical way without writing a function?

adamcfraser · 2023-11-08T22:34:53Z

bucket-registry.go

-var bucketRegistry = map[string][]*Bucket{} // Maps URL to slice of Buckets at that URL
-var bucketRegistryMutex sync.Mutex          // Thread-safe access to bucketRegistry
+type bucketRegistry struct {
+	byURL       map[string][]*Bucket


Does this need to be a []*Bucket, or is it just used for ref counting and so could be something like []bool or []struct{}.

I replaced this with integer refcounting.

adamcfraser · 2023-11-08T22:36:10Z

bucket-registry.go

-var bucketRegistryMutex sync.Mutex          // Thread-safe access to bucketRegistry
+type bucketRegistry struct {
+	byURL       map[string][]*Bucket
+	inMemoryRef map[string]*Bucket


As discussed the names are a bit confusing if 'byURL' is actually the ref count, and 'inMemoryRef' is the lookup of bucket by URL.

I picked names of bucketCount for refcount and buckets for the actual cached copy of the bucket.

- create a copy of the bucket in getCachedBucket so caller doesn't have to remember to make a copy

adamcfraser

Looks good to me. One suggestion for additional documentation on the bucket registry as discussed, but otherwise looks good.

adamcfraser · 2023-11-16T21:46:43Z

bucket-registry.go

-// `realpath` or `fcntl(F_GETPATH)`) and use the canonical path as the key.
-// Unfortunately Go doesn't seem to have an API for that.
+// * In Memory bucket: bucket is not deleted until any Bucket's CloseAndDelete is closed.
+// * On disk bucket: bucket is deleted from registry when all there are no open copies of the bucket in memory. Unlike in memory bucket, the bucket will stay persisted on disk to be reopened.


As discussed, a bit more context on how buckets are managed here will make this easier to maintain,. In particular I'm thinking of explicitly documenting:

what's shared between bucket copies (almost everything)

what's not shared, and why (in particular 'closed')

torcolvin force-pushed the CBG-3271 branch 4 times, most recently from da6d24d to cafb640 Compare November 8, 2023 14:37

torcolvin mentioned this pull request Nov 8, 2023

CBG-3540 keep url read-only once bucket is open #15

Closed

torcolvin force-pushed the CBG-3271 branch from fb31ef1 to c0c6c76 Compare November 8, 2023 14:42

torcolvin mentioned this pull request Nov 8, 2023

CBG-3271 remove persistent rosmar buckets couchbase/sync_gateway#6570

Merged

6 tasks

adamcfraser requested changes Nov 8, 2023

View reviewed changes

torcolvin added 2 commits November 8, 2023 20:39

Simplify the bucket registry to represent a cached version and a lock

ad1cc68

- create a copy of the bucket in getCachedBucket so caller doesn't have to remember to make a copy

Fix logging message

c9907cf

adamcfraser approved these changes Nov 16, 2023

View reviewed changes

torcolvin added 2 commits November 16, 2023 18:19

rename file and update comment

01eb20a

Update sgbucket

6a8875d

torcolvin merged commit adb4806 into main Nov 16, 2023
12 checks passed

torcolvin deleted the CBG-3271 branch November 16, 2023 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBG-3271 allow in memory buckets to persist #16

CBG-3271 allow in memory buckets to persist #16

torcolvin commented Nov 8, 2023 •

edited

Loading

adamcfraser Nov 8, 2023

torcolvin Nov 9, 2023

adamcfraser Nov 8, 2023

torcolvin Nov 9, 2023

adamcfraser Nov 8, 2023

torcolvin Nov 9, 2023

adamcfraser Nov 8, 2023

torcolvin Nov 9, 2023

adamcfraser Nov 8, 2023

torcolvin Nov 9, 2023

adamcfraser left a comment

adamcfraser Nov 16, 2023

		}
		// Copy the slice before mutating, in case a client is iterating it:

CBG-3271 allow in memory buckets to persist #16

CBG-3271 allow in memory buckets to persist #16

Conversation

torcolvin commented Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamcfraser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

torcolvin commented Nov 8, 2023 •

edited

Loading